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Abstract 

One of the interesting and important problems of information diffusion over a large social net- 
work is to identify an appropriate model from a limited amount of diffusion information. There 
are two contrasting approaches to model information diffusion. One is a push type model, known 
as Independent Cascade (IC) model and the other is a pull type model, known as Linear Thresh- 
old (LT) model. We extend these two models (called AsIC and AsLT in this paper) to incorporate 
asynchronous time delay and investigate 1) how they differ from or similar to each other in terms of 
information diffusion, 2) whether the model itself is leamable or not from the observed information 
diffusion data, and 3) which model is more appropriate to explain for a particular topic (informa- 
tion) to diffuse/propagate. We first show that there can be variations with respect to how the time 
delay is modeled, and derive the likelihood of the observed data being generated for each model. 
Using one particular time delay model, we show that the model parameters are leamable from a 
limited amount of observation. We then propose a method based on predictive accuracy by which to 
select a model which better explains the observed data. Extensive evaluations were performed us- 
ing both synthetic data and real data. We first show using synthetic data with the network structures 
taken from four real networks that there are considerable behavioral differences between the AsIC 
and the AsLT models, the proposed methods accurately and stably learn the model parameters, and 
identify the correct diffusion model from a limited amount of observation data. We next apply these 
methods to behavioral analysis of topic propagation using the real blog propagation data, and show 
that there is a clear indication as to which topic better follows which model although the results are 
rather insensitive to the model selected at the level of discussing how far and fast each topic prop- 
agates from the learned parameter values. The correspondence between the topic and the model 
selected is well interpretable considering such factors as urgency, popularity and people's habit. 
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1. Introduction 

The growth of Internet has enabled to form various kinds of large-scale social networks, through 
which a variety of information including innovation, hot topics and even malicious rumors can 
be propagated in the form of so-called "word-of-mouth" communications. Social networks are 
now recognized as an important medium for the spread of information, and a considerable num- 
ber of studies have been made (Newman, Forrest, & Balthrop, 2002; Newman, 2003; Gruhl, 
Guha, Liben-Nowell, & Tomkins, 2004; Domingos, 2005; Leskovec, Adamic, & Huberman, 2006; 
Romero, Meeder, & Kleinberg, 2011; Bakshy, Hofman, Mason, & Watts, 2011; Mathioudakis, 
Bonch, Castillo, Gionis, & Ukkonen, 2011). 

Widely used information diffusion models in these studies are the independent cascade (IC) (Gold- 
enberg, Libai, & MuUer, 2001; Kempe, Kleinberg, & Tardos, 2003; Kimura, Saito, & Motoda, 
2009) and the linear threshold (LT) (Watts, 2002; Watts & Dodds, 2007). They have been used to 
solve such problems as the influence maximization problem (Kempe et al., 2003; Chen, Wang, & 
Yang, 2009; Kimura, Saito, Nakano, & Motoda, 2010) and the contamination minimization prob- 
lem (Kimura et al., 2009). These two models assume different mechanisms for information diffusion 
which are based on two opposite views. In the IC model each active node independently influences 
its inactive neighbors with given diffusion probabilities {information push style model). In the LT 
model a node is influenced by its active neighbors if their total weight exceeds the threshold for the 
node {information pull style model). Which model is more appropriate depends on the situation and 
selecting the appropriate one for a particular problem is an interesting and important problem. To 
answer this question, first of all, we have to understand the behavioral difference between there two 
models. 

Both models have parameters that need be specified in advance: diffusion probabilities for the IC 
model, and weights for the LT model. However, their true values are not known in practice, which 
poses a challenging problem of estimating them from a limited amount of information diffusion 
data that are observed as time-sequences of influenced (activated) nodes. Fortunately this falls in a 
well defined parameter estimation problem in machine leaming setting. Given a generative model 
with its parameters and the independent observed data, we can calculate the UkeUhood that the 
data are generated and can estimate the parameters by maximizing the likelihood. This approach 
has a thorough theoretical background. The way the parameters are estimated depends on how the 
generative model is given. To the best of our knowledge, we were the first to follow this line of 
research. We addressed this problem first for the basic IC model (Saito, Nakano, & Kimura, 2008; 
Kimura, Saito, Nakano, & Motoda, 2009) and then its variant that incorporates asynchronous time 
delay (referred to as the AslC model) (Saito, Kimura, Ohara, & Motoda, 2009). We further applied 
this to a variant of the LT model that also incorporates asynchronous time delay (referred to as the 
AsLT model) (Saito, Kimura, ohara, & Motoda, 2010a; Saito, Kimura, Ohara, & Motoda, 2010c). 

Gruhl et al. (2004) also challenged the same problem of estimating the parameters and proposed 

an EM-like algorithm, but they did not formalize the likelihood and it is not clear what is being 
optimized in deriving the parameter update formulas. Goyal, Bonchi, and Lakshhmanan (2010) 
attacked this problem from a different angle. They employed a variant of the LT model and esti- 
mated the parameter values by four different methods, all of which are directly computed from the 
frequency of the events in the observed data. Their approach is efficient, but it is more likely ad 
hoc and lacks in theoretical evidence. Bakshy, Karrer, and Adamic (2009) addressed the problem 
of diffusion of user-created content (asset) and used the maximum likelihood method to estimate 



2 



Learning Asynchronous-Time Information Diffusion Models 



the rate of asset adoption. However, they only modeled the rate of adoption and did not consider 
the diffusion model itself. Their focus was data analysis. Gomez-Rodriguez, Leskovec, and Krause 
(2010) proposed an efficient method of inferring a network from the observed diffusion sequences 
based on the continuous time version of the IC model, assuming the probability that a node affects 
its child node is a function of the difference of the activation times between the two nodes. Their 
focus is inferring the structure of the network rather than inferring the best predictive model for a 
known network. They fixed a model and approximated the likelihood function in such a way that 
the simplified UkeUhood function can be maximized by adding a link in each iteration. Recent work 
of Myers and Leskovec (2010) is close to ours. They used a model similar to but different in details 
from the AsIC model and showed that the liklihood maximization problem can effectively be trans- 
formed to a convex programming for which a global solution is guaranteed^ . Their focus was also 
inferring the structure of the network. 

In this paper, we first detail the Asynchronous Independent Cascade Model and the Asyn- 
chronous Linear Threshold Model as two contrasting information diffusion models. Both are exten- 
sions of the basic Independent Cascade Model and Linear Threshold Model that incorporate time 
delay in an asynchronous way. Especially we focus on the liklihood derivation of these models. We 
show that there are a few variations of time delay and different time delay models result in different 
hklihood formulations. We then show for a particular time delay model how to obtain the parameter 
values that maximize the respective liklihood by deriving an EM-like iterative approach using the 
observed sequence data. Indeed, being able to cope with asynchronous time delay is indispensable 
to do realistic analysis of information diffusion because, in the real world, information propagates 
along the continuous time axis, and time-delays can occur during the propagation asynchronously. 
In fact, the time stamps of the observed data are not equally spaced. This means that the proposed 
learning method has to estimate not only the diffusion parameters (diffusion probabilities for the 
AsIC model and weights for the AsLT model) but also the time-delay parameters from the observed 
data. We identified that there are basically two types of delay: link delay and node delay. The 
former corresponds to the delay associated with information propagation, and the latter corresponds 
to the delay associated with human action which is further divided into two types: non-override 
and override. We choose link delay to explain the learning algorithms and perform the experiments 
on this model. For the other time delay models we only derive the likelihood functions that are re- 
quired for the learning algorithms. Incorporating time-delay makes the time-sequence observation 
data structural, which makes the analysis of diffusion process difficult because there is no way of 
knowing which node has activated which other node from the observation data sequence. 

Knowing the optimal parameter values does not mean that the observation follows the model 
well. We have to decide which model better explains the observation and select the right (or more 
appropriate) model. We solve this problem by comparing the predictive accuracy of each model. 
We use a variant of hold-out method applied to a set of sequential data, which is similar to the 
leave-one-out method applied to a multiple time sequence data, i.e., we use a part of the data, train 
the model, predict the activation probability at one step later and compare it with the observation. 
We repeat this by changing the size of the training data. 

In summary, we want to 1) clarify how the AsIC model and the AsLT model differ from or 
similar to each other in terms of information diffusion, 2) propose a method to leam the model 
parameters from a Umited number of observed data and show that the method is effective, and 3) 



1. We discuss the difference between their model and our model in Section 7. 
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show that how the information diffuses depend on the topic and the proposed method can identify 
which model is more appropriate to explain for a particular topic (information) to diffuse/propagate. 

We have performed extensive experiments to verify the proposed approaches using both syn- 
thetic data and real data. Experiments using synthetic data generated by the models (AsIC and 
AsLT) with network structures taken from four real networks revealed that there are considerable 
behavioral difference between the AsIC and the AsLT models, and the difference can be explained 
by the diffusion mechanism qualitatively. It is also shown that the proposed likhhood maximization 
methods accurately and stably leam the model parameters, and identify the correct diffusion model 
from a limited amount of observation data. Experiments of behavioral analysis of topic propaga- 
tion using the real blog data show that the results are rather insensitive to the model selected at 
an abstract level of discussing how relatively far and fast each topic propagates from the learned 
parameter values but still there is a clear indication as to which topic better follows which model. 
The correspondence between the topic and the model selected is well interpretable considering such 
factors as urgency, popularity and people's habit. 

The paper is organized as follows. In Section 2, we introduce the two contrasting information 
diffusion models (AsIC and AsLT) we used in this paper, and in Section 3, we detail how the 
UkeUhood functions can be formulated for various variations of time delay model and in Appendix 
how the parameters can be obtained using one particular model of time delay (link delay). In Section 
4, we show the detailed analysis results of behavioral difference between AsIC and AsLT obtained 
by using four real network structures. In Section 5 we detail the learning performance (accuracy of 
parameter learning and influential node ranking) using the synthetic data obtained by the same four 
real network structure. In Section 6 we focus on model selection using both synthetic data and a real 
blog network data. In Section 7 we discuss some of the important issues regarding the related work 
and those for future work. We end the paper by summarizing what has been achieved in Section 8. 

2. Information Diffusion Models 
2.1 Two Contrasting Diffusion Models 

It is quite natural to bring in the notion of information sender and receiver. The IC model is sender- 
centered. It is motivated by epidemic spread in which the disease carrier is the information sender. 

If a person gets infected, his or her neighbors also get infected, /. e. , the information sender tries to 
push information to its neighbors. The LT model is receiver- centered. It is based on the view that 
the receiver has a control over the information flow. This models the way innovation propagates. 
For example, a person is attempted to buy a new tablet PC if many of his or her neighbors have 
purchased it and said that it is good, i.e., the information receiver tries to pull information. 

Both models have respective reasons for their working mechanisms, but they are quite contrast- 
ing to each other We are interested in 1) how they differ from or similar to each other in terms of 
information diffusion, 2) whether the model itself is leamable or not from the observed information 
diffusion data, and 3) which model is more appropriate to explain for a particular topic (informa- 
tion) to diffuse/propagate. Both models have parameters, i.e., diffusion probability attached to each 
directional link in the IC model and weight attached to each directional link in the LT model. As 
shown later in Section 3.2, the weight is equivalent to a probability. Thus, intuitively both models 
appear to be comparative in terms of the average influence degree if the parameter values are com- 
parable. The simulation results, however, show that these two models behave quite differently. We 
will explain why they are different in Section 4.2. 
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In the following two subsections we will describe the two diffusion models that we use in this 
paper: the asynchronous independent cascade (AsIC) model, first introduced by Saito et al. (2009), 
and the asynchronous linear threshold (AsLT) model, first introduced by Saito et al. (2010a). They 
differ from the basic IC and LT models in that they explicitly handle the time delay. The diffusion 
process evolves with time. The basic models deal with time by allowing nodes to change their states 
in a synchronous way at each discrete time step, i.e., no time delay is considered, or one can say that 
every state change is uniformly delayed exactly by one discrete time step. Their asynchronous time 
delay versions expUcitly treat the time delay of each node independently. We discuss the notion of 
time delay in more depth in Section 3.3.1. 

The models we explain in the following two sub sections and the learning algorithms we de- 
scribe in Section 3 are based on a particular time-delay model, which we call link delay. This is the 
model that the time delay is caused by the communication channel, e.g., network traffic and/or some 
malfunction, and as soon as the information arrives at the destination, the node responds without 
delay. 

Before we explain the models, we give the definition of a graph and children and parents of a 
node. A graph we use is a directed graph G = (V, E) without self-links, where V and E {c V xV) 
stand for the sets of all the nodes and Unks, respectively. For each node v in the network G, we 
denote F{v) as a set of child nodes of v, i.e., 

F{v) = {weV; {v,w) e E}. 

Similarly, we denote B{v) as a set of parent nodes of v, i.e., 

B{v) = {ueV; {u,v) e E}. 

We call nodes active if they have been influenced with the information. In the following models, we 
assume that nodes can switch their states only from inactive to active, but not the other way around, 
and that, given an initial active node set S, only the nodes in S are active at an initial time. 

2.2 Asynchronous Independent Cascade Model 

We first recall the definition of the IC model according to the work of Kempe et al. (2003), and then 

introduce the AsIC model. In the IC model, we specify a real value pu,v with < pu^v < 1 for each 
hnk {u, v) in advance. Here Pu,v is referred to as the diffusion probability through Hnk (u, v). The 
diffusion process unfolds in discrete time-steps t > 0, and proceeds from a given initial active set 
5 in the following way. When a node u becomes active at time-step t, it is given a single chance to 
activate each currently inactive child node v, and succeeds with probability „. If u succeeds, then 
V will become active at time-step t + 1. If multiple parent nodes of v become active at time-step t, 
then their activation attempts are sequenced in an arbitrary order, but all performed at time-step t. 
Whether or not u succeeds, it cannot make any further attempts to activate v in subsequent rounds. 
The process terminates if no more activations are possible. 

In the AsIC model, we specify real values ru,v with ru,v > in advance for each link {u, v) & E 
in addition to Pu y, where ru,v is referred to as the time-delay parameter through hnk {u, v). The 
diffusion process unfolds in continuous-time t, and proceeds from a given initial active set S in the 
following way. Suppose that a node u becomes active at time t. Then, u is given a single chance 
to activate each currently inactive child node v. We choose a delay-time S from the exponential 
distribution^ with parameter ru,v If v has not been activated before time t + 6, then u attempts 

2. Similar formulation can be derived for other distributions such as power-law and Weibull. 
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to activate v, and succeeds with probability Pu,v If succeeds, tlien v will become active at time 
t + 6. Said differently, whichever parent u that succeeds in satisfying the activation condition 
and for which the activation time is the earliest considering the time delay associated with each 
link can actually activate the node. Under the continuous time framework, it is unlikely that v is 
activated simultaneously by its multiple parent nodes exactly at time t + 5. So we do not consider 
this possibility. Whether or not u succeeds, it cannot make any further attempts to activate v in 
subsequent rounds. The process terminates if no more activations are possible. 

2.3 Asynchronous Linear Threshold Model 

Same as the above, we first recall the LT model. In this model, for every node v eV,v/e specify a 
weight {qu,v > 0) from its parent node u in advance such that 

The diffusion process from a given initial active set S proceeds according to the following random- 
ized rule. First, for any node v eV, a threshold 9y is chosen uniformly at random from the interval 
[0, 1]. At time-step t, an inactive node v is influenced by each of its active parent nodes, u, according 
to weight qu,v If the total weight from active parent nodes of v is no less than 9^, that is, 

^ ^ Qu,V ^ ^U) 

then V will become active at time-step t + 1. Here, Bt{v) stands for the set of all the parent nodes 
of V that are active at time-step t. The process terminates if no more activations are possible. 

The AsLT model is defined in a similar way to the AsIC. In the AsLT model, in addition to 
the weight set {qu,v}, we specify real values r„ .u with r^^v > in advance for each link {u,v). 
Same as for AsIC, we refer to ru,v as the time-delay parameter through link [u, v). The diffusion 
process unfolds in continuous-time t, and proceeds from a given initial active set S in the following 
way. Each active parent u of the node v exerts its effect on v with the time delay 5 drawn from the 
exponential distribution with the delay parameter Vu^v Suppose that the accumulated weight from 
the active parents of node v has become no less than 9^ at time t for the first time. Then, the node v 
becomes active at t without any delay and exerts its effect on its child with a delay associated with 
its fink. This process is repeated until no more activations are possible. 

3. Learning Algorithms 

We define the diffusion parameter vector p and the time-delay parameter vector r by 

for the AsIC model, and the weight parameter vector q and the time-delay parameter vectors r by 

Q = (9«,'y)(„,^)ef; ^ ^^■<^,^){u,v)eE 

for the AsLT model. We next consider an observed data set of M independent information diffusion 
results, 

[Dm] m = 1,---,M}. 
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Here, each is a set of pairs of active node and its activation time in the m-th diffusion result, 

We denote by tm,v the activation time of node v for the m-th diffusion result. For each D^, we 
denote the observed initial time by 

and the observed final time by 

Note that Tm is not necessarily equal to the final activation time. Hereafter, we express our obser- 
vation data by 

Vm = {iDm,Tm); m = 1, • • • ,M}. 

For any t e [tm, Tm], we set 

Cm{t) = {veV; {v, )eDm, <t}. 

Namely, Cm{t) is the set of active nodes before time t in the m-th diffusion result. For convenience 
sake, we use Cm as referring to the set of all the active nodes in the m-th diffusion result, i.e.. 

Cm = U Cm{t)- 

Moreover, we define a set of non-active nodes with at least one active parent node for each by 

dCm = {v eV; {u,v) G E, ue Cm, v ^ Cm]- 

For each node v € Cm U dCm, we define the following subset of parent nodes, each of which had 
a chance to activate v. 



B 



B{v) n Cm{tm,v) if e Cm, 
B{v)r\Cm ifvedC 



Note that the underlying model behind the observed data is not available in reality. Thus, we 
investigate how the model affects the information diffusion results, and consider selecting a model 
which better explains the given observed data from the candidates, i.e., AsIC and AsLT models. To 
this end, we first have to estimate the values of r and p for the AsIC model, and the values of q and 
r for the AsLT model for the given Vm- 



3.1 Learning Parameters of AsIC Model 

First, we propose a method of learning the model parameters from the observed data for the AsIC 
model. To estimate the values of r and p from Vm for the AsIC model, we derive the hkehhood 

function C{r,p; Vm) to use as the objective function. 

First, for the m-th information diffusion result, we consider any node v € Cm with t,,; „ > tm, 
and derive the probability density hm,v that the node v is activated at time tm,v Note that hm,v = 1 
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if tm,v = tm- Let Xm,u,v denote the probability density that a node u G Bm,v activates the node v at 
time tm,v< that is, 

Xm,u,v — Ptt, t; 6Xp( Tu^viimjV im,u)\ (1) 

Let Xr!,(i. u denote the probability that the node v is not activated by a node u G Bm,v within the 
time-period [tm,u: tm,v]' ^^at is, 

rtm,v 

ym,u,v ~ 1 Pu,v I fu,v^^p{, fu,v{t tm,u))dt 

tm,u 

If there exist multiple active parents for the node v, i.e., \Bm^v | > 1, we need to consider possibilities 
that each parent node succeeds in activating v at time tm,v However, in case of the continuous time 
delay model, we don't have to consider simultaneous activations by multiple active parents due to 
the continuous property. Here, for any u G Bm,v^ let hm,v (u) be the probability density that the 
node u activates v at time tm,v but all the other nodes z in Bm,v have failed in activating v within 
the time-period [tm, tm,v] for the m-th information diffusion result. Then, we have 

zeBm,v\{u} 

Since the probability density hm,v is given by hm,v = T,ueBm,v hm,v{u), we have 

/ ^ 

U&Bm,v \z&Bm,v\{u} 

Note that we are not able to know which node u actually activated the node v. This can be regarded 
as a hidden structure. 

Next, for the m-th information diffusion result, we consider any link {v, w) e E such that 

V G Crn and w ^ Cm, and derive the probability gm,v that the node v fails to activate its child 
nodes. Note that gm,v = 1 if ^(^O \ Cia = 0. Let gm,v,w denote the probability that the node w 
is not activated by the node v within the observed time period [tm, Tm]- We can easily derive the 
following equation: 

9m,v,w — Pv,w 

exp(-r^,^(r^ - tm,v)) + (1 - Pv,w)- (4) 

Here we can naturally assume that each information diffusion process finished sufficiently earlier 
than the observed final time, i.e., Tm maix.{tm,v',{v,tm,v) G T>m}. Thus, as Tm ^ oo in 
Equation (4), we can assume 

9m,v,w ~ 1 Pv,w (5) 

Therefore, the probability gm,v is given by 

9m,v — J^J^ 9m,v,w (6) 

weF{v)\Cm 
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By using Equations (3) and (6), and the independence properties, we can define the likelihood 
function C{r,p; T>m) with respect to r and p by 

M 

c{r,p;VM)= n n 

m,v 9m,v )• (7) 

m=l veCm 

In this paper, we focus on Equation (5) for simplicity, but we can easily modify our method 
to cope with the general one (i.e., Equation (4)). Thus, our problem is to obtain the values of r 
and p, which maximize Equation (7). For this estimation problem, we derive a method based on 
an iterative algorithm in order to stably obtain its solution. The details of the parameter update 
algorithm are given in Appendix A. 

3.2 Learning Parameters of AsLT Model 

Next, we propose a method of learning the model parameters from the observed data for the AsLT 
model. Similarly to the AsIC model, we first derive the likelihood function C{r,q;VM) with 
respect to r and q. For the sake of technical convenience, we introduce a slack weight for each 
node V &V such that 

Qv,v ~^ ^ ^ Qu,v 1- 

Here note that we can regard each weight as a multinomial probabiUty since a threshold 9y is 
chosen uniformly at random from the interval [0, 1] for each node v. 

First, for the m-th information diffusion result, we fix any node v G Cm with tm^v > tm, and 
derive the probability density hm,v that the node v is activated at time tm,v Note that hm^v = 1 
if tm,v = tm- Suppose any parent node z G Bm,v exerts its effect on v with a delay 6z,v Further 
suppose that the threshold 9^ is first exceeded when the effect of u G Bjn,v reaches v after the delay 
^u,v We define the subset Bm,v{u) of Bm,v by 

Then, we have 

HZ,V 

Z G. , V (u) zeBm,v{u) 

This implies that the probability that 9y is chosen from this range is qu,v Let Xm,u,v denote the 
probability density that node u activates node v at time tm^v Then, we have 

Since the probabiUty density hjn,v is given by hm,v = J2u€Bm v '^rn,u,v^ we have 

hm,v ~ ^ ] (lu,vfu,v ^^Vi^ 'fu,v{tm,v tm,u))- (9) 

m,v 

Next, for the m-th information diffusion result, we consider any node v G dCm, and derive 
the probability gm,v that node v is not activated within the observed time period [tm, Tm]- We can 
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calculate gm,v as 



9m,v 



= 1 



= 1 




— Q.v,v 



+ 



))■ 



(10) 



ueB{v)\Bm,v ueBm,v 



Therefore, by using Equations (9) and (10), and the independence properties, we can define the 
UkeUhood function jC{r,q; Vm) with respect to r and q by 



Thus, our problem is to obtain the time-delay parameter vector r and the weight parameter vector 
q, which together maximize Equation (11). The details of the parameter update algorithm are given 
in Appendix B. 

3.3 Alternative Time-delay models 

In Section 2 we introduced one instance of time delay, i.e., link delay. In this subsection we discuss 
time delay phenomena in more depth for both the AslC and the AsLT models. 

3.3.1 Notion of Time-delay 

Each parent m of a node v can be activated independently of the other parents and because the 
associated time delay from a parent to its child is different for every single pair, which parent u 
actually affects the node v in which order is more or less opportunistic. 

To expUcate the information diffusion process in a more realistic setting, we consider two ex- 
amples, one associated with blog posting and the other associated with electronic mailing. In case 
of blog posting, assume that some blogger u posts an article. Then it is natural to think that it takes 
some time before another blogger v comes to notice the posting. It is also natural to think that if 
the blogger v reads the article, he or she takes an action to respond (activated) because the act of 
reading the article is an active behavior. In this case, we can think that there is a delay in information 
diffusion from uto v (from u's posting and t;'s reading) but there is no delay in v taking an action 
(from v's, reading to v's, posting). In case of electronic mailing, assume that someone u sends a mail 
to someone else v. It is natural to think that the mail is delivered to the receiver v instantaneously. 
However, this does not necessarily mean that v reads the mail as soon as it has been received be- 
cause the act of receiving a mail is a passive behavior. In this case, we can think that there is no 
delay in information diffusion from uio v (ti's sending and f 's receiving) but there is a delay in v 
taking an action (from u's receiving to u's sending). Further, when v notices the mail, v may think 
to respond to it later. But before v responds, a new mail may arrive which needs a prompt response 
and V sends a mail immediately. We can think of this as an update of acting time.^ These are just 

3. Note that there are two actions here, reading and sending, but the activation time in the observed sequence data 
corresponds to the time v sends a mail. 




(11) 
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two examples, but it appears worth distinguishing the difference of these two kinds of time delay 
and update scheme (override of decision) in a more general setting. 

In view of the discussion above, we define two types of delay: link delay and node delay. It 
is easiest to think that link delay corresponds to propagation delay and node delay corresponds to 
action delay. We further assume that they are mutually exclusive. This is a strong restriction as well 
as a strong simplification by necessity because the activation time of a node we can observe is a sum 
of the activation time of its parent node and the two delays and we cannot distinguish between these 
two delays. Thus we have to choose either one of the two as occurring exclusively for the likelihood 
maximization to be feasible. In addition, in case of node delay there are two types of activation: 
non-override and override. The former sticks to the initial decision when to activate and the latter 
can decide to update (override) the time of activation multiple times to the earliest possible each 
time one of the parents gets newly activated. In summary, node delay can go with either override or 
non-override, and link delay can only go with non-override. 

Since we have already derived the likelihood function for Unk delay, here we consider the Uke- 
Mhood function for node delay. In this case, the time delay parameter vector r is expressed as 
f = {fv)vev- The likelihood function C{r,p; Vm) for the AsIC in the case of node delay is given 
by Equation (7), where hm,v is the probability density that node v is activated at time tm,v for the m- 
th information result, and gm,v is the probability that node v does not activate its child nodes within 
the observed time period [t^, T^] for the m-th information result. Note that Qm^v remains the same 
as in the case of link delay (see Equations (5) and (6)). The likelihood function i2(r, g; Djv/) for the 
AsLT in the case of node delay is given by Equation (11), where the definition of hm,v is the same 
as above, and gm,v is the probabiUty that the node v is not activated within the observed time period 
[tm, Tm] for the mth information result. Note also that gm,v remains the same as in the case of link 
delay (see Equation (10)). Therefore, our task now is: We fix any node v G Cm with tm^v > tm, and 
present the probability density hm,v that node v is activated at time tm,v for the m-th information 
result in the case of node delay. Here for simplicity, we order the active parent node u G Bjn,v of 
node V according to the time tu it was activated, and set 

3.3.2 Alternative Asynchronous Independent Cascade Model 

First, we derive hm,v for node delay with non-override and hjn,v for node delay with override in the 
case of the AsIC model. 

Node delay with non-override There is no delay in propagating the information to the node v 
from the node u, but there is a delay 6 before the node v gets actually activated. Assume that it 
is the node ui that first succeeded in activating the node v (more precisely satisfying the activation 
condition). Since there is no link delay and no override, it must be the case that all the other parents 
that had become active before i„. must have failed in activating v (more precisely satisfying the 
activation condition). Since the node v decides when to actually activate itself at the time the node 
Ui succeeded in satisfying the activation condition and would not change its mind, other nodes which 
may have been activated after the node Ui got activated could do nothing on the node v. Thus, the 
probability density hm,v is given by 
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J j-l 

hm,v — ^ ] '^m,tt,,r' | | (1 ~Pui,v)-i 

j=l i=l 

where Xm,uj,v is the probabihty density that node Uj activates node v at time tm,v, and is obtained 
by ' ' 

)), (12) 

(see Equation (1)). Note that in comparison to Equation (3), the probability ym,ui,v is replaced by 

Node delay with override In this case the actual activation time is allowed to be updated. For 

example, suppose that the node u, first succeeded in satisfying the activation condition of the node 

V and the node v decided to activate itself at time t^- + 5;. At some time later but before tu^ + Si, 
other parent Uj also succeeded in satisfying the activation condition of the node v. Then the node 

V is allowed to change its actual activation time to time + Sj if it is before + 6i. Thus, the 
probability density hm,v is given by 

J J 

j=l i='i-,i¥'j 

Here, Xm,Uj,v is the probability density that node Uj activates node v at time tm^v, and is obtained 
by Equation (12). Also, ym,ui,v is the probability that node v is not activated by node Ui within the 
time-period [tm,ui,tm,v], and is obtained by 

ym,Ui,v ~ Pui,v 6Xp( Tv(tjji,v ^m,Ui)) (1 Pui,v) 

(see Equation (2)). Note that this formula hm,v is equivalent to Equation (3) except that the param- 
eter ru,v is replaced by r^. 

3.3.3 Alternative Asynchronous Linear Threshold Model 

Next, we derive hm,v for node delay with non-override and hjn,v for node delay with override in the 
case of the AsLT model. 

Node delay with non-override As soon as the parent node Ui is activated, its effect is immediately 

exerted to its child v. The delay depends on the node t;'s choice. Suppose the node v first became 
activated for the i-th parent according to the time t„- ordering. Then by the same reasoning as in 
Section 3.2, the threshold 9^ is between X]*c\guj,i) and Y^^^iquj,v + Qui,v^ and the probability 
density hm,v can be expressed as 

J 

where Xrn.Uj,v is the probability density that node uj activates node v at time t„i,v, and is obtained 
by Equation (8). Note that this formula is equivalent to Equation (9) except that the parameter ru,v 
is replaced by r^. 
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Node delay with override Here, multiple updates of the activation time of the node v is allowed. 
Suppose that the node v's threshold is first exceeded by receiving the effect of the parent Uj. All the 
parents that have become activated after that can still influence the updates. Among these parents, 
let Ui be the one which succeeded in activating the node v and let {u^} be the other parents that 
failed. Then, the probability density Xrn,uj,v that the node v is activated at time tm,v by the node Uj, 
which get activated later than Uj for which the threshold is first exceeded is given by 

i=j C=J,C7^i 
J 

= QujA-^ -j + '^^v n ^^P{-fv{tm,v - tm,Ui))- 
i=j 

Thus, we obtain 

J 

3=1 

Note that this formula is substantially different from Equation (9). 
3.3.4 Summary of Different Time Delay Models 

We note that hm,v for link delay and node delay with override is identical for the AsIC model and 
that for link delay and node delay with non-override is identical for the AsLT model, except for 
a minor notational difference in the time delay parameter r in both. Thus, there are basically two 
cases for each model. We omit to show how different time delay models affect diffusion phenomena. 
There are indeed some differences in transient time period (for the first 10 to 30 time span in unit of 
average time delay). The difference becomes larger as the values for diffusion parameters become 
larger as expected. For more details, see the work of Saito, Kimura, Ohara, and Motoda (2010b). 

We only showed the parameter learning algorithms for the case of link delay for both AsIC and 
AsLT models in Appendix. It is straightforward to derive the similar algorithm for the other time 
delay models. 

3.4 Assumptions Introduced in Parameter Setting 

The formulations so far assumed that the parameters (pu,v,Qu,v and r^ ^^) that appear both in the 
AsIC and the AsLT models depend on individual link {u, v} € E. The number of parameters, thus, 
is equal to the number of links, which is huge for any realistic social network. This means that we 
need a prohibitively huge amount of observation data that passes each link at least several times to 
obtain accurate estimates for these parameters that do not overfit the data. This is not realistic and 
we can introduce a few alternative simplifying assumptions to avoid this overfitting problem. 

The simplest one would be to assume that each of the parameters pu,v, Qu,v and i^u,v be repre- 
sented by a single variable for the whole network. For a diffusion probability, we assume a uniform 
value Pu,v = P for all links. For a weight we assume a uniform coefficient q such that qu,v = \bIv)\ ' 

4. Note that difference in the time delay models vanishes when an equilibrium is reached. 

5. To be more precise we assumed that ru,v = in case of node-delay. Simplification in this case can also be made 
accordingly. 



13 



SAITO, KIMURA, OHARA, & MOTODA 



i.e., the weight q^^v is proportional to the reciprocal of the number of v's parents. This is the sim- 
plest realization to satisfy the constraint J2ueB{v) 1u,v < 1- As can be shown later in Section 6.3.2, 
this is a reasonable approximation to discuss information diffusion for a specific topic. Next simpli- 
fication would be to divide E (or V) into subsets Ei,E2,..., E'l^ (or Pi, V2j Viy) and assign the 
same value for each parameter within each subset. For example, we may divide the nodes into two 
groups: those that strongly influence others and those not, or we may divide the nodes into another 
two groups: those that are easily influenced by others and those not. Links connecting these nodes 
can accordingly be divided into subsets. If there is some background knowledge about the node 
grouping, our method can make the best use of it. Obtaining such background knowledge is also 
an important research topic in the knowledge discovery from social networks. Yet another simpli- 
fication which looks more realistic would be to focus on the attribute of each node and assume that 
there is a generic dependency between the parameter values of a link and the attribute values of the 
coimected nodes and learn this dependency rather than learn the parameter values directly from the 
data. In Saito, Ohara, Yamagishi, Kimura, and Motoda (201 1) we adopted this approach assuming a 
particular class of attribute dependency, and confirmed that the dependency can be correctly leamed 
even if the number of parameters is several tens of thousands. Leaming a function is much more 
realistic and does not require such a huge amount of data. This way it is possible that the parameter 
values take different values for each link (or node). 

4. Behavioral Difference between the AsIC and the AsLT Models 
4.1 Data Sets and Parameter Setting 

We employed four datasets of large real networks (all bidirectionally connected). The first one is 
a trackback network of Japanese blogs used by Kimura et al. (2009) and has 12, 047 nodes and 
79, 920 directed links (the blog network). The second one is a network of people derived from the 
"Ust of people" within Japanese Wikipedia, also used by Kimura et al. (2009), and has 9, 481 nodes 
and 245, 044 directed links (the Wikipedia network). The third one is a network derived from the 
Enron Email Dataset (Klimt & Yang, 2004) by extracting the senders and the recipients and Unking 
those that had bidirectional communications. It has 4, 254 nodes and 44, 314 directed links (the 
Enron network). The fourth one is a coauthorship network used by Palla, Derenyi, Parkas, and 
Vicsek (2005) and has 12, 357 nodes and 38, 896 directed links (the coauthorship network). These 
networks are confirmed to satisfy the typical characteristics of social networks, e.g., power law for 
degree distribution, higher clustering coefficient, etc. 

In this experiments, we set the value of diffusion probability (AsIC) and the value of the link 
weight (AsLT) such that they are consistent in the following sense under the simplest assumption 
to make a fair comparison: T,(u,v)eE Pu,v = T,(u,v)eE <lu,v = \V\. Thus, = l/d and q^^v = 
l/\B(v)\ for any {u,v) € E, where dis the average out-degree of the network. Thus, the value of 
Pu,v ({u, v) G E) is given as 0.15, 0.04, 0.1, and 0.32 for the Blog, the Wikipedia, the Enron, and 
the Coauthorship networks, respectively. 

We compare influence degree obtained by the AsIC and the AsLT models from various angles. 
Here, the influence degree a{v) of a node v is defined to be the expected number of active nodes 
at the end of information diffusion process that starts from a single initial activate node v. Since 
the time-delay parameter vector r does not affect the influence degree (because it is defined at the 
end of diffusion process), that is, cj(y) is invariant with respect to the value of r, we can evaluate 
the value of a{v) by the influence degree of the corresponding basic IC or LT model. We estimated 
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influence degree influence degree 

(c) Enron network (d) Coauthorship network 

Figure 1: Comparison of influence degree between the AsIC and the AsLT models 

the influence degree by the bond percolation based method (Kimura et al., 2010), in which we used 
300, 000 bond percolation processes according to Kempe et al. (2003), meaning that the expectation 
is approximated by the empirical mean of 300, 000 independent simulations. 

4.2 Experimental Results 

First, we investigated which of the AsIC and AsLT models can spread information more widely. 
Figure 1 shows the cumulative probability of influence degree, fa{x) = \{v G V; a{v) > x}\/\V\, 
for the AsIC and the AsLT models. At a glance we can see that the AsIC model has by far many 
more nodes of high influence degrees than the AsLT model. Further, we examined the difference 
of influence degree between the two models for the respective influential nodes of both the AsIC 
and the AsLT models. We ranked nodes according to the influence degree of AsIC and AsLT, 
respectively, and extracted the top 200 influential nodes for each. Figures 2 and 3 display the 
respective influence degree of rank k node of AsIC and AsLT (k = 1, ■ ■ ■ , 200). Here, the red line 
indicates the influence degree of AsIC, and the blue line indicates the influence degree of AsLT. 
We can see that the difference of influence degree between the two models is quite large for these 
influential nodes. This clearly indicates that the information can diffuse more widely under the 
AsIC model than the AsLT model. This can be attributed to the scale-free nature (having power-law 
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(c) Enron network (d) Coauthorship network 

Figure 2: Influence degree of AsIC and AsLT for the influential nodes of the AsIC model 



degree distributions) of the four real networks used in the experiments. It is known (Albert, Jeong, & 
Barabasi, 2000) that hub nodes, defined as those having many outgoing links, play an important role 
for widely spreading information in a scale-free network. By the information diffusion mechanism 
of the AsIC and AsLT models, it is more difficult for the AsLT model to transmit information to hub 
nodes than the AsIC model in a scale-free network. Therefore, the result is understandable. 

Next, we compared the difference of the influential nodes between the AsIC and the AsLT 
models. The results are shown in Figures 4 and 5. For both figures the horizontal axes are node 
ranking (k = 1, • • • , 200), and the actual ranking depends which model we are considering, e.g., 
the rank k node for AsIC is different from the same rank k node for AsLT. The vertical axis are 
influence degree for both figures, but it is the influence degree for AsIC in Figure 4 and that for 
AsLT in Figure 5. The red line corresponds to nodes for AsIC and the blue line corresponds to 
nodes for AsLT. Thus, by definition of node ranking, the influence degree of AsIC (red thick line) 
is non-increasing in Figure 4 and the influence degree of AsLT (blue thick line) in Figure 5 is non- 
increasing. However, the corresponding line for AsLT (blue line) in Figure 4 and that for AsIC (red 
line) in Figure 5 are very irregular. This means that almost all the nodes that are influential for AsIC 
model are different from the nodes that are influential for AsLT, and vice versa. There are small 
number of influential nodes that overlap for both the models, but how similar the influential nodes 



16 



Learning Asynchronous-Time Information Diffusion Models 




10 



10 



0) 

03 
■D 

<u 10' 

o 

c 

CD 



10 



50 100 150 
rank 

(a) Blog network 



200 



AsIC 

— AsLT 



CD 
T3 

<D 
13 



10^ 



10 



<D 
CD 
1 

O) 
CD 
■D 

<1> 10' 

o 

CD 
3 



— AsIC 
— AsLT 



50 100 150 
rank 

(b) Wikipedia network 



200 



10 





— AsIC 
— AsLT 





200 



50 100 150 200 1 50 100 150 

rank rank 

(c) Enron network (d) Coauthorship network 

Figure 3: Influence degree of AsIC and AsLT for the influential nodes of the AsLT model 



are (degree of overlapping) depends on the characteristics of the network structure, and no general 
tendency can be extracted. 



5. Learning Performance Evaluation 
5.1 Data Sets and Parameter Setting 

We used the same four datasets that are used in Section 4, and employed also the simplest approx- 
imation for the parameter setting but with a slight difference according to the work Saito et al. 
(2009). 

We set pu.v = p, ru,v = r for AsIC and q^^v = q\B{v)\~^, r^^v = f for AsLT. Under this 
assumption there is no need for the observation sequence data to pass through every link or node at 
minimum once and desirably several times. This drastically reduces the amount of data we have to 
generate to use as the training data to learn the parameters. Then, our task is to estimate the values 
of these parameters from the training data. According to the work of Kempe et al. (2003), we set 
p to a value slightly smaller than 1/d. Thus, the true value of p was set to 0.2 for the coauthorship 
network, 0.1 for the blog and Enron networks, and 0.02 for the Wikipedia network. The true value 
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Figure 4: Comparison of the influential nodes of AsIC and AsLT measured in the influence degree 
of AsIC 



of q was set to 0.9 for every network to achieve reasonably long diffusion results, and the true value 
of r was set to 1.0.^ 



Using these parameter values, we generated a diffusion sequence from a randomly selected 
initial active node for each of the AsIC and the AsLT models in four networks. We then constructed 
a training dataset such that each diffusion sequence has at least 10 nodes. Parameter updating is 
terminated when either the iteration number reaches its maximum (set to 100) or the following 
condition is first satisfied: |r('^+-'-) — r^^^ \ + — p(^^\ < 10^^ for AsIC and |r(^+^) — r^^^ \ + 

_ I < io~6 for AsLT, where the superscript (s) indicates the value for the s-th iteration. 
In most of the cases, the above inequality is satisfied in less than 100 iterations. The converged 
values are rather insensitive to the initial parameter values, and we confirmed that the parameter 
updating algorithm stably converges to the correct values which we assumed to be the true values. 
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Figure 5: Comparison of the influential nodes of AsIC and AsLT measured in the influence degree 
of AsLT 

Table 1: Parameter estimation error of the learning method for the AsIC model in four networks 



Network 


Number of active nodes 




f 




1,163 


0.019 


0.026 


Blog 


5,151 


0.018 


0.014 




10,322 


0.011 


0.011 




1,275 


0.060 


0.032 


Wikipedia 


5,386 


0.013 


0.009 




10,543 


0.006 


0.007 




1,456 


0.031 


0.030 


Enron 


5,946 


0.011 


0.011 




10,468 


0.005 


0.006 




1,203 


0.028 


0.022 


Coauthorship 


5,193 


0.009 


0.007 




10,132 


0.006 


0.006 
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Table 2: Parameter estimation error of the learning method for the AsLT model in four networks 



Network 


Number of active nodes 








1,023 


0.020 


0.020 


Blog 


5,018 


0.012 


0.020 




10,037 


0.012 


0.020 




1018 


0.032 


0.024 


Wikipedia 


5,038 


0.015 


0.020 




10,025 


0.006 


0.017 




1,017 


0.023 


0.014 


Enron 


5,054 


0.013 


0.011 




10,024 


0.007 


0.010 




1,014 


0.017 


0.034 


Coauthorship 


5,023 


0.017 


0.029 




10,023 


0.006 


0.027 



5.2 Parameter Estimation 

We generated the training set for each of the AslC and the AsLT models as follows to evaluate the 
proposed learning methods as a function of the number of observed active nodes, i.e., amount of 
the training data. First we specified the target number K of the active nodes we want to have, and 
the training set is generated by increasing the sequence one by one such that the total number of 
active nodes reaches K with each sequence starting from a randomly chosen initial active node, 
skipping very short ones (those in which the number of nodes is less than 10). In the experiments, 
we investigated the cases of if = 1, 000, 5, 000, 10, 000. Let r*, p* and q* denote the true values of 
r, p and q, respectively, and f, p and q the estimated values of r, p and q, respectively. We define 
the parameter estimation errors £r, Sp and £q by 

_ \r-r*\ _ \p-p*\ _ \q-q*\ 

Tables 1 and 2 show the parameter estimation errors of the proposed leaming methods for the AsIC 
model and the AsLT model in four networks as a function of the number of observed active nodes, 
respectively. Here, the results are averaged over five independent experiments. As can be expected, 
the error is progressively reduced as the number of active nodes becomes larger. The algorithm 
guarantees to converge but does not guarantee the global optimal solution. In most of the cases, the 
number of iterations is less than 100. These results indicate that it converges to the correct solution 
in practice for all the parameters and for all the networks, which demonstrate the effectiveness of 
the proposed methods. 

Next, we investigated the performance of the proposed leaming method when the training set 
is a single diffusion sequence. Table 3 shows the results for four networks, where the results are 
averaged over 100 independent experiments. Compared with Tables 1 and 2, the errors become 
larger. The average error of p and r for AsIC is 6% and 8%, and the average error of q and r for 

6. Note that a different value of r corresponds to a different scaling of the time axis under the assumption of uniform 
value. 
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Table 3: Parameter estimation error of the learning method from a single observed sequence for four 
networks (Values in parentheses are standard deviations.) 



Network 


Blog 


Wikipedia 


Enron 


Coauthorship 


AsIC 


f 


0.091 (0.121) 
0.064 (0.085) 


0.088 (0.132) 
0.043 (0.056) 


0.029 (0.020) 
0.022 (0.019) 


0.119(0.173) 
0.121 (0.255) 


AsLT 




0.188 (0.219) 
0.078 (0.049) 


0.192 (0.272) 
0.069 (0.043) 


0.143 (0.140) 
0.077 (0.053) 


0.214(0.194) 
0.086 (0.054) 




(c) Enron network (d) Coauthorship network 

Figure 6: Influence curve and the learned parameter values from a single observed sequence in case 
of AsIC (There aie 100 sequences and 100 points in each figure.) 



AsLT is 8% and 18%, respectively. The best results for AsIC is Enron network (2% for p and 3% 
for r), and the best results for AsLT is Wikipedia network (7% for q) and Enron network (14% for 
r). The worst results for AsIC is Coauthorship network (12% for p and 11% for r), and the worst 
results for AsLT is Coauthorship network (9% for q and 21% for r). In general the accuracy is better 
for AsIC than for AsLT. This is because the lengths of the sequences are larger for AsIC. Further, 
r is more difficult to correctly estimate than p and q. In order to see the difference in the learning 
result for each sequence in more depth, we plotted the number of active nodes as a function of time 
(the influence curve), ^ and the values of the parameters learned, (p, r) for AsIC and (g, r) for AsLT, 
in Figures 6 and 7. The length of each sequence varies considerably. Some sequences are short and 
some others are long. The color of the dots for the learned parameters is determined in such a way 
that it goes from true blue to true red in proportion to the sequence length, i.e., the shortest sequence 
is true blue and the longest sequence is true red. From these results we can see the algorithm learns 
the parameter values within 10% of the correct values if the length is reasonably long. For example, 
Enron network generates long sequences from aU the randomly chosen initial active nodes in case of 



7. This is different from the influence degree a described in Section 4.1 which is the expected value of the number of 
active nodes at the final time. 
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(c) Enron network (d) Coauthorship network 

Figure 7: Influence curve and the learned parameter values from a single observed sequence in case 
of AsLT (There are 100 sequences and 100 points in each figure.) 

AsIC and the learning accuracy is very good. We draw a conclusion that although it is not desirable 
we can still estimate the parameter values from a single observation sequence if this is the only 
choice available. 

5.3 Node Ranking 

We measure the influence of node v by the influence degree a{v) for the diffusion model that has 
generated Vm- We compared the result of the high ranked influential nodes for the true model 
that uses the assumed true parameter values with 1) the proposed method that uses the learned 
parameter values, 2) four heuristics widely used in social network analysis (all computed by the 
network topology alone) and 3) the same proposed method in which an incorrect diffusion model 
is assumed, i.e., data generated by AsIC but learning assumed AsLT and vice versa. Here again 
the influence degree is estimated by the bond percolation method (Kimura, Saito, & Nakano, 2007; 
Kimura et al., 2010), where we used 10, 000 bond percolation processes according to Kimura et al. 
(2009) and Kimura et al. (2010). 

We call the proposed method the model based method. We call it the AsIC model based method 
if it employs the AsIC model as the information diffusion model. We then learn the parameters of the 
AsIC model from the observed data Vm, and rank nodes according to the influence degrees based 
on the learned model. The AsLT model based method is defined in the same way. Among the four 
heuristics we used, the first three are "degree centrality", "closeness centrality", and "betweenness 
centrality". These are commonly used as influence measure in sociology (Wasserman & Faust, 
1994), where the out-degree of node v is defined as the number of links going out from v, the 
closeness of node v is defined as the reciprocal of the average distance between v and other nodes in 
the network, and the betweenness of node v is defined as the total number of shortest paths between 
pairs of nodes that pass through v. The fourth is "authoritativeness" obtained by the "PageRank" 
method (Brin & L.Page, 1998). We considered this measure as one alternative since this is a well 
known method for identifying authoritative or influential pages in a hyperlink network of web pages. 
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Figure 8: Performance comparison in extracting influential nodes for the AsIC model 



This method has a parameter e; when we view it as a model of a random web surfer, e corresponds 
to the probability with which a surfer jumps to a page picked uniformly at random (Ng, Zheng, & 
Jordan, 2001). In our experiments, we used a typical setting of e = 0.15. 

In terms of extracting influential nodes from the network G = (F, E), we evaluated the perfor- 
mance of the ranking methods mentioned above by the ra«/:mg J^(fc) = \L*'{]t) r\L{k)\/k 
within the rank k{> 0), where L*{k) and L{k) are the true set of top k nodes and the set of top 
k nodes for a given ranking method, respectively. We focused on the performance for high ranked 
nodes since we are interested in extracting influential nodes. Figures 8 and 9 show the results for 
the AsIC and the AsLT models, respectively. For the diffusion model based methods, we plotted 
the average value of F{k) at k for five independent experimental results. We see that the proposed 
method gives better results than the other methods for these networks, demonstrating the effective- 
ness of our proposed learning method. It is interesting to note that the model based method in which 
an incorrect diffusion model is used is as bad as and in general worse than the heuristic methods. 
The results imply that it is important to consider the information diffusion process explicitly in dis- 
cussing influential nodes and also to identify the correct model of information diffusion for the task 
in hand, same observation as in Section 4. 
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Figure 9: Performance comparison in extracting influential nodes for the AsLT model 



6. Model Selection 

Now we have a method to estimate the parameter values from the observation for each of the as- 
sumed models. In this section we discuss whether the proposed learning method can correctly 
identify which of the two models: AsIC and AsLT the observed data come from, i.e., Model Selec- 
tion problem. We assume that the topic is the decisive factor in determining the parameter values 
and place a constraint that the parameters depend only on topics but not on nodes and hnks of the 
network G, and differentiate different topics by assigning an index / to topic /. 

Therefore, we set ri^u,v = fi and pi^u,v = Pi for any link (n, v) £ E in case of the AsIC model 
and ri^u,v = fi and qi^u,v = (li\B{v)\~^ for any node v and link {u, v) £ E m case of the AsLT 
model. Note that {) < qi < 1 and q^^^ = 1 — qi. Since we normally have a very small number of 
observation for each {l,u,v), often only one, without this constraint, there is no way to learn the 
parameters. 

6.1 Model Selection based on Predictive Accuracy 

We have to select a model which is more appropriate to the model for the observed diffusion se- 
quence. We decided to use predictive accuracy as the criterion for selection. We cannot use an 
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information theoretic criterion such as AIC (Akaike Information Criterion)(Akaike, 1978) or MDL 
(Minimum Description Length)(Rissanen, 1978) because we need to select the one from models 
with completely different probability distributions. Moreover, for both models, it is quite difficult 
to efficiently calculate the exact activation probability of each node for more than two information 
diffusion cascading steps ahead. In order to avoid these difficulties, we propose a method based on a 
hold-out strategy, which attempts to predict the activation probabilities at one step ahead and repeat 
this multiple times. 

We now group the observed data sequences Dm into topics. Assume that each topic I has Mi 
sequences of observation, i.e., Di = {-D;,^) m = 1, ■ ■ ■ , Mi}, where each Di^m is a set of pairs of 
active node and its activation time in the m-th diffusion result in the l-th topic. Accordingly we add 
a subscript I to other variables, e.g., we denote ti^m,v to indicate the time t that a node v is activated 
in the m-th sequence of the Z-th topic. 

We learn the model parameters for each topic separately. This does not exclude treating each 
sequence in a topic separately and learn from each, i.e., M; = 1, which would help investigating 
if the same topic propagate similarly or not. For simplicity, we assume that for each Di j^, the 
initial observation time ti^m is zero, i.e., ti^m = for m = 1, • • • , Mi. Then, we introduce a set of 
observation periods 

II = {[0,Ti,n); n=l,---,Ni}, 
where Ni is the number of observation data we want to predict sequentially and each Ti^n has the 
following property: There exists some {v, ti^rn,v) e ^i,m such that < r;,„ < ti^rn,v Let Di^^.^^^ 
denote the observation data in the period [0, r; „) for the m-th diffusion result in the Zth topic, i.e.. 

We also set Vmi-ti „ = {(A,m;T( „) t";,™); m = 1, • • •, Mi}. Let denote the set of parameters for 
either the AsIC or the AsLT models, i.e., = (r,p) or = (r, q). We can estimate the values 
of from the observation data Vmi-ti „ by using the learning algorithms in Sections 3.1 (Appendix 
A.) and 3.2 (Appendix B.). Let ©^-^ „ denote the estimated values of 0. Then, we can calculate the 

activation probability q^^ ^ {v, t) of node v at time t (> r/ „) using ©r; „• 
For each r;,„, we select the node t;(T/,„) and the time i/,m(r,,„),j;(T(,„) by 

( Ml 
[ m=l 

Note that f (r/,„) is the first active node in t > Ti^n- We evaluate the predictive performance for the 
node f (t/ „) at time „),i)(Ti „)• Approximating the empirical distribution by 

with respect to (w(r„), t;,r„(ri n),v(Ti „))' we employ the KuUback-Leibler (KL) divergence 

^^K,„ \\<lrj = PriJv,t)\og dt, 

where 6v,w and d{t) stand for Kronecker's delta and Dirac's delta function, respectively. Then, we 
can easily show 

KL{p^^J\q^J = -log/i^(^, (13) 
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Table 4: Accuracy of the model selection method for four networks 



Network 


Blog 


Wikipedia 


Enron 


Coauthorship 


AslC 


92 


100 


100 


93 




(370.2) 


(920.8) 


(1500.6) 


(383.5) 


AsLT 


79 


86 


99 


76 




(28.2) 


(54.0) 


(47.7) 


(19.0) 



By averaging the above KL divergence with respect to Xi, we propose the following model selection 
criterion £ (see Equation (13)): 

1 

5(^;A,iU---UA,mJ = -TrElog^-Kn),.(r,J, (14) 

' n=l 

where A expresses the information diffusion model (i.e., the AsIC or the AsLT models). In our 
experiments, we adopted 

where tq is the median time of all the observed activation time points. 
6.2 Evaluation by Synthetic Data 

Our goal here is to evaluate the model selection method to see how accurately it can detect the true 
model that generated the data, using topological structure of four large real networks described in 
Section 4.1. We assumed the true model by which the data are generated to be either AsLT or AsIC. 
We have to repeatedly estimate the parameters using the proposed parameter update algorithms. In 
actual computation the learned values for observation period [0, Ti^n] are used as the initial values 
for observation period [0, r^n+i] for efficiency purpose. 

The average KL divergence given by Equation (14) is the measure for the goodness of the model 
A for a training set Di of Mi sequences with respect to topic I. The smaller its value is, the better 
the model explains the data in terms of predictability. Thus, we can estimate the true model from 
which Di is generated to be AsIC if £{AsIC; Di) < £{AsLT; Di), and vice versa. Using each 
of the AsIC and the AsLT models as the true model, we generated a training set Di. Here we set 
Ml = 1, i.e., we generated a single diffusion sequence, learned a model and performed the model 
selection. We repeated this 100 times independently for the four networks mentioned before. We 
could have set Mi = 100 and learned a single parameter set. This is more reliable, but we wanted 
to know whether the model selection algorithm works well or not using only a single sequence of 
data. 

Table 4 summarizes the number of times that the model selection method correctly identified 
the true model. The number within the parentheses is the average length of the diffusion sequences 
in the training set. From these results, we can say that the proposed method achieved a good ac- 
curacy, 90.6% on average. Especially, for the Enron network, its estimation was almost perfect. 
To analyze the performance of the proposed method more deeply, we investigated the relation be- 
tween the length of sequence and the model selection result. Figure 10 shows the results for the 
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Figure 10: Relation between the length of sequence and the accuracy of model selection for a single 
diffusion sequence generated from the AsIC model (There are 100 points.) 
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Figure 1 1 : Relation between the length of sequence and the accuracy of model selection for a single 
diffusion sequence generated from the AsLT model (There are 100 points.) 



case that Di is generated by the AsIC model. Here, the horizontal axis denotes the length of se- 
quence in each dataset and the vertical axis is the difference of the average KL divergence defined 
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by J{AsIC; AsLT) = £{AsLT; Di) - £{AsIC; Di). Thus, J{AsIC; AsLT) > means that the 
proposed method correctly estimated the true model AsIC because it means 

£{AsIC; Di) is smaller than £{AsLT; Di). From the figure, we can see that there is a correla- 
tion between the length of sequence and the estimation accuracy, and that the misselection occurs 
when the length of the sequence is short. In particular, Wikipedia and Blog networks have no mis- 
selection. Figure 1 1 shows the results for the case that Di is generated by the AsLT model. Here, 
J{AsLT; AsIC) = £{AsIC; Di) - £{AsLT; Di). We notice that the overall accuracy becomes 
95.5% when considering only the sequences that contain no less than 20 nodes. This means that 
the proposed model selection method is highly reUable for a long sequence and its accuracy could 
asymptotically approach to 100% as the sequence gets longer. We can also see from Figures 10 and 
1 1 that the results for the AsIC model are better than those for the AsLT model. We note that the 
plots for the diffusion sequences generated from the AsIC model are shifted to the right in all net- 
works, meaning that the diffusion sequences are longer for AsIC than for AsLT. The better accuracy 
is attributed to this. 

6.3 Evaluation by Real World Blog Data 

We analyzed the behavior of topics in a real world blog data. Here, again, we assumed the true model 
behind the data to be either the AsIC model or the AsLT model. Using each pair of the estimated 
parameters, {ri,pi) for AsIC and {ri,qi) for AsLT, we first analyzed the behavior of people with 
respect to the information topics by simply plotting them as a point in 2-dimensional space. We 
next estimated the true model for each topic by applying the model selection method described in 
Section 6.1. 

6.3.1 Data Sets and Parameter Setting 

We employed the real blogroll network used by Saito et al. (2009), which was generated from the 
database of a blog-hosting service in Japan called Doblog. ^ In the network, bloggers are connected 
to each other and we assume that topics propagate from blogger x to another blogger y when there 
is a blogroll link from y to x. In addition, according to the work of Adar and Adamic (2005), it is 
assumed that a topic is represented as a URL which can be tracked down from blog to blog. We 
used the propagation sequences of 172 URLs for this analysis, each of which has at least 10 time 
steps. In these 172 URLs some of them are the same, meaning that there are multiple sequences 
for the same topic, i.e., M; > 1. However, as in the analysis of Section 6.2, we treated them as 
if = 1 and used each sequence independently. The main reason for this is that we want to 
investigate whether the same topic propagates in the same way when there are multiple sequences 
as well as to test whether the model selection is feasible from a single sequence data in case of the 
real data. 

6.3.2 Parameter Estimation 

We ran the experiments for each identified URL and obtained the parameters p and r for the AsIC 
model based method and q and r for the AsLT model based method. Figures 12a and 12b are the 
plots of the results for the major URLs (topics) by the AsIC and AsLT methods, respectively. The 
horizontal axis is the diffusion parameter p for the AsIC method and q for the AsLT method, while 

8. Doblog(http : / / www . doblog . com/), provided by NTT Data Corp. and Hotto Link, Inc. 
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Figure 12: Results for the Doblog database 



the vertical axis is the delay parameter r for both. The latter axis is normalized such that r = 1 
corresponds to a delay of one day, meaning r = 0.1 corresponds to a delay of 10 days. In these 
figures, we used five kinds of markers other than dots, to represent five different typical URLs: the 
circle (o) stands for a URL that corresponds to the musical baton which is a kind of telephone game 
on the Internet (the musical baton),^ the square (□) for a URL that corresponds to articles about a 
missing child (the missing child), the cross (x) for a URL that corresponds to articles about fortune 
telling (the fortune telling), the diamond (O) for a URL of a certain charity site (the charity), and 
the plus (+) for a URL of a site for flirtatious tendency test (the flirtation). All the other topics are 
denoted by dots (•), which means they are a mixture of many topics. 

The results indicate that in general both the AsIC and AsLT models capture reasonably well the 
characteristic properties of topics in a similar way. We note that the same topic behaves similarly 
for different sequences except for the fortune telling. This supports the assumption we made in Sec- 
tion 6.1. Careful look at the URLs used to identify the topic of fortune telling indicates that there are 
multiple URLs involved and mixing them as a single topic may have been a too crude assumption. 
Other interpretation is that people's perception on this topic is not uniform and varies considerably 
from person to person and should be viewed as an exception of the assumption. Behavior of the 
other topics is interpretable. For example, the results capture the urgency of the missing child, 
which propagates quickly with a meaningful probability (one out of 80 persons responds). Musical 
baton which actually became the latest craze on the Internet also propagates quickly (less than one 
day on the average) with a good chance (one out of 25 to 100 persons responds). In contrast non- 
emergency topics such as the flirtation and the charity propagate very slowly. We further note that 
the dependency of topics on the parameter r is almost the same for both AsIC and AsLT, but that on 
the parameters p and q is slightly different, e.g., relative difference of musical baton, missing child 
and charity. Although -p and q are different parameters but both are the measures that represent how 
easily the diffusion takes place. As is shown in Section 5.3, the influential nodes are very sensitive 
to the model used and this can be attributed to the differences of these parameter values. 



9. It has the following rules. First, a blogger is requested to respond to five questions about music by some other blogger 
(receive the baton) and the requested blogger replies to the questions and designates the next five bloggers with the 
same questions (pass the baton). 
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Table 5: Results of model selection for the Doblog dataset 



Topic 


Total 


AsLT 


AsIC 


Musical baton 


9 


5 


4 


Missing child 


7 





7 


Fortune telling 


28 


4 


24 


Charity 


6 


5 


1 


Flirtation 


7 


7 





Others 


115 


11 


104 
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Figure 13: The relation between the KL difference and sequence length for the Doblog database 



6.3.3 Results OF Model Selection 

In the analysis of previous subsection, we assumed that all topics follow the same diffusion model. 
However, in reality this is not true and each topic should propagate following more closely to either 
one of the AsLT and AsIC models. We attempt to estimate the underlying behavior model of each 
topic by applying the model selection method described in Section 6.1. As explained, we treat each 
sequence independently and learn the parameters from each sequence, calculate its KL divergences 
by Equation (14) for both the models, and compare the goodness. Table 5 and Figure 13 summarize 
the results. From these results, we can see that most of the diffusion behaviors on this blog network 
follow the AsIC model. It is interesting to note that the model estimated for the musical baton is not 
identical to that for the missing child although their diffusion patterns aie very similar (see Section 
6.3.2). The missing child strictly follows the AsIC model. This is attributed to its greater urgency. 
People would post what they know if they think it is useful without influenced by the behaviors 
of their neighbors. For musical baton Table 5 indicates that the numbers are almost tie (4 vs. 5), 
but we saw in Section 6.2 that the longer sequence gives a better accuracy, and the models selected 
in longer sequences are all AsLT in Figure 13 for musical baton. Thus, we estimate that musical 
baton follows more closely to AsLT. This can be interpreted that people follow their friends in this 
game. Likewise, it is easy to imagine that people would behave similarly to their neighbors when 
requested to give a donation. This explains that charity follows AsLT. The flirtation clearly follows 
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AsLT. People are attempted to do bad things when their neighbor do so. Note that there exists one 
dot at near the top center in Figure 13, showing the greatest tendency to follow AsLT. This dot 
represents a typical circle site that distributes one's original news article on personal events. 

7. Discussion 

Myers and Leskovec (2010) have recently proposed a method in which the liklihood is described 
in somewhat generic way with respect to a given diffusion dataset for a wide class of IC type in- 
formation diffusion models. Their purpose is to infer the latent network structure. On the other 
hand, our interest is to explore the salient characteristics of two contrasting information diffusion 
models assuming that the structure is known. Although their purpose is substantially different from 
ours, we share with them the common idea of estimating parameters in information diffusion mod- 
els. However, there exist some mathematically notable differences. The main difference comes 
from the derivation of the probability density hm,v that one or more active parent nodes of a node 
V succeed(s) in activating v at time tm,v for the m-diffusion sequence (see Equation (3)). In order 
to clarify this point, we denote the corresponding formula used in Myers and Leskovec (2010) by 
hm,v^ then hm,v is expressed as follows: 

hm,v — 1 ~ (1 ~ w(tra,v ~ 'tm,u)-^i,j)- (15) 

U^Cm itm,v ) 

where, according to their terminology, w{t) and Aij stand for the transmission time model and 
the conditional probability of infection transmission, respectively. Here note that the product term 
w{tm,v — tm,u)Aij is equivalent to our formula Xm,u,v, where Xm,u,v is defined as the probability 
density that a node u activates the node v at time tm,v (see Equation (1)). 

For an active parent node u, the term (1 — w{tm,v — tm,u)Aij) appearing in Equation (15) 
conceptually corresponds to our formula ym,u,v^ where ym,u,v is defined as the probabifity that the 
node V is not activated by the node u within the time-period [tm,M, im,t)) (see Equation (2)). Here 
note that from the observed m-th diffusion sequence, we know for sure that the node u could not 
succeed in activating v during the time interval t € [tm,u, tm,v)- Namely, our formulation reflects 
this observation expUcitly in probability estimation, rather than just subtracting the probabifity from 
1, as in the expression (1 — w{tm,v — tm,u)Aij). Furthermore, we can transform Equation (2) as 
follows: 

poo 

ym,u,v = {^-Pu,v)+ Pu,vru,vex.p{-ru,vit-tm,u)) dt. (16) 

Here we can naturally interpret this formula as follows: the first term of right-hand-side is the 
probability that the node u fails to activate v, and the second term corresponds to the probability 
that the node u succeeds in activating v after the tm,v^ i-e., the fact that the node v is not activated 
by the node u within the time-period [tm.u, tm,v) means that it has either failed to activate v at all 
or succeeded to activate v but the activation time is outside of the observed time-period. The basic 
interpretation of hm,v is that at least one active parent node activates v at time tm,v Namely, the 
formulation allows that v is activated simultaneously by its multiple parent nodes exactly at time 
tm,v< while our formulation does not consider this possibility. When the diffusion process unfolds in 
continuous-time t, the probability measure of such simultaneous activation is zero. Thus, we employ 
our hm,v formulation as described in Equation (3)). Of course, in case of the discrete-time modeling. 
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the situation of simultaneous activation by multiple active parents must be considered adequately. 
The objective function for this case under the discrete-time IC model has been derived in Kimura, 
Saito, Ohara, and Motoda (2011). The major advantage of their method is that it guarantees a 
unique optimal solution, whereas ours only guarantees that it converges to a stationary solution 
which is not necessarily a global maximum. However, it is not clear that a similar approach can be 
applied to Linear Threshold type diffusion models. In addition, as discussed above and also shown 
in Section 3.3, we need to elaborate on the formula for hm,v in order to model the information 
diffusion process more accurately reflecting subtle notion of different time delay models and as 
much information of observed data as possible. It is also not clear that the above advantage of their 
formulation still holds when the formula for hrn,v is modified accordingly. Our view is that their 
formulation can be a useful technique for inferring latent network structure, but it has limitation if 
we use it to explore the salient characteristics of different diffusion models. In this sense, we beUeve 
that our approach based on the EM-Uke learning algorithm remains vital and useful for a wide class 
of information diffusion models. 

We started with general description for the parameter values but had to introduce drastic simpli- 
fication in experimental evaluations both for synthetic datasets and real world datasets. The results 
in Section 6.3.2 impUes that the assumption of topics being a decisive factor for diffusion parameter 
values seems to be plausible, which in turn justifies the use of the same parameter values for multiple 
sequence observation data if they are talking on the same topic. However, as one counter example 
is observed (fortune telling), this is definitely not true in general. Finding a small number of factors, 
e.g., important node attributes, from which the parameter values can be estimated in good accuracy 
is a crucial problem. Learning such dependency is easy as exemplified in Saito et al. (2011) once 
such factors are identified and the real world data for such factors ara available as part of observed 
information diffusion data. 

As we explained in Section 5.3, the ranking results that involve detailed probabilistic simulation 
are very sensitive to the underlying model which is assumed to generate the observed data. In other 
words, it is very important to select an appropriate model for the analysis of information diffusion 
from which the data has been generated if the node characteristics are the main objective of analysis, 
e.g., such problems as the influence maximization problem (Kempe et al., 2003; Kimura et al., 
2010), a problem at a more detailed level. However, it is also true that the parameters for the topics 
that actually propagated quickly/slowly in observation converged to the values that enable them to 
propagate quickly /slowly on the model, regardless of the model chosen. Namely, we can say that the 
difference of models does not have much influence on the relative difference of topic propagation 
which indeed strongly depends on topic itself. Both models are well defined and can explain this 
property at this level of abstraction. Nevertheless, the model selection is very important if we want 
to characterize how each topic propagates through the network. 

One of the objectives of this paper is to understand the behavioral difference between the AsIC 
model and the AsLT model. The analysis in Section 4.2 is based on the network structures taken 
from real world data. We feel more mathematical-oriented treatment is needed to qualitatively 
understand the behavior difference of these two models for a wide class of graphs from various 
perspectives, e.g., types of graphs: regular vs random, graphs with different characteristics: power- 
law, small-worldness, community structure, etc. 

There are other studies that deal with topic dependent information diffusion. Recent study by 
Romero et al. (2011) discusses differences in the diffusion mechanism across different topics. They 
experimentally obtain from the observation data the probability p{k) that a node gets activated after 
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its active parents failed to activate it k — 1 times in succession, and model the diffusion process 
using p(k) under the SIR (Susceptible/Infectious/Recover) setting. Their finding is that the shape of 
p{k) differs considerably from one topic to another, which is characterized by two factors, stickness 
(maximum value of p{k)) and persistency (rate of p(fc)'s decay after the peak), and that the repeated 
exposures to a topic are particularly crucial when it is in some way controversial or contentious. 
Another recent study on Twitter by Bakshy et al. (2011) attempts to quantify a node's influence 
degree (the number of nodes that a seed node (initial node) can activate by leaming a regression tree 
using various node's attributes such as no. of followers, no. of friends, no. of tweets, past influence 
degree and content related features. To their surprise none of the content related attributes are se- 
lected in the learned regression tree. They attribute this to the fact that most explanations of success 
tend to focus only on observed success, which invariably represent a small and biased sample of 
the total population. They conclude that individual level predictions of influence is unreliable, and 
it is important to rely on average performance. Both studies approach the similar problem from 
different angles. There are many factors that need be considered and much more work is needed to 
understand this problem. 

8. Conclusion 

We deal with the problem of analyzing information diffusion process in a social network using 
probabilistic information diffusion models. There are two contrasting fundamental models that have 
been widely used by many people: Independent Cascade model and Linear Threshold model. These 
are modeled based on two different ends of the spectrum. The IC model is sender-centered (infor- 
mation push style model) where the information sender tries to push information to its neighbors, 
whereas the LT model is receiver-centered (information pull style model where the information re- 
ceiver tries to pull information. We extended these two contrasting models (called AsIC and AsLT) 
by incorporating asynchronous time delay to make them realistic enabling effective use of observed 
information diffusion data. Using these as the basic tools, we challenged the following three prob- 
lems: 1) to clarify how these two contrasting models differ from or similar to each other in terms 
of information diffusion, 2) to devise effective algorithms to leam the model itself from the ob- 
served information diffusion data, and 3) to identify which model is more appropriate to explain for 
a particular topic (information) to diffuse/propagate. 

We first showed that there can be variations to each of these two models depending on how 
we treat time delay. We identified there are two kinds of time delay: link delay and node delay, 
and the latter is further divided into two categories: override and non-override. We derived the 
liklihood function, the probability density to generate the observed data for each model. Choosing 
one particular time delay model, we showed that the model parameters are leamable from a hmited 
amount of observation by deriving the parameter update algorithm for both AsIC and AsLT that 
maximizes the likelihood function which is guaranteed to converge and performs stably. We also 
proposed a method to select a model that better explains the observation based on its predictive 
accuracy. To this end, we devised a variant of hold-out training algorithm apphcable to a set of 
sequential data and a method to select a better model by comparing the predictive accuracy using 
the KL divergence. 

Extensive evaluations were performed using both synthetic data and real data. We first showed 
using synthetic data with the network structures taken from four real networks that there are consid- 
erable behavioral difference between the AsIC and the AsLT models, and gave a qualitative account 
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of why such difference is brought. We then experimentally confirmed that the proposed parameter 
update algorithm converges to the correct values very stably and efficiently, it can learn the param- 
eter values even from a single observation sequence if its length is reasonably long, it can estimate 
the influential nodes quite accurately whereas the frequently used centrality heuristics performs very 
poorly, the influential nodes are very sensitive to the model used, and the proposed model selection 
method can correctly identify the diffusion models by which the observed data is generated. We 
further applied the methods to the real blog data and analyzed the behavior of topic propagation. 
The relative propagation speed of topics, i.e., how far/near and how fast/slow each topic propagates, 
that are derived from the leamed parameter values is rather insensitive to the model selected, but the 
model selection algorithm clearly identifies the difference of model goodness for each topic. We 
found that many of the topics follow the AsIC model in general, but some specific topics have clear 
interpretations for them being better modeled by either one of the two, and these interpretations are 
consistent with the model selection results. There are numerous factors that affect the information 
diffusion process, and there can be a number of different models. Understanding the behavioral 
difference of each model, learning these models efficiently from the available data and selecting the 
correct model are a big challenge in social network analysis and this work is the first step towards 
this goal. 
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Appendix A. Learning Algorithm for AsIC model 

Maximizing C{r,p;'DM) is equivalent to maximizing its logarithm. Let f = {fu,v) and p = 
{pu,v) be the current estimates of r and p, respectively. Taking log of hm,v involves log of sum 
of '^m,u,v{ym.,u,v)~^ r which is problematic. To get around this problem, we define am,u,v for each 
{v, tm,v) G -Dm and u G Bm,v, by 



Let ym,u,v^ hm,v and arn,u,v denote the values of X^^u,v^ ym,u,v hm,v and (y.m,u,v calcu- 

lated by using f and p, respectively. 

From Equations (3), (5) and (7), we can transform our objective function C{r,p;VM) as fol- 
lows: 




-1 



log C{r,p; Vm) = Q{r, p; r, p) - H{r, p; r, p) 



(17) 



where Q{r,p;r,p)h defined by 
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and H{r,p;r,p) is defined by 



M 



H{r,p;r,p) = Yl 




■m.u.v 



(18) 



m=l veCm ueB, 



'm,v 



Since H{r,p;r,p) is maximized at r = r and p = p from Equation (18)/'' we can increase 
the value of C{r,p;'DM) by maximizing Q{r,p;f,p) (see Equation (17)). Note here that Q is a 
convex function with respect to r and p, and thus the convergence is guaranteed. Here again we 
have a problem of log of sum for log^,„,u,^. In order to cope with this problem, we transform 
log yjn,u,v in the same way as we introduced am,u,v^ define Prn,u,v by 



Finally, we obtain the following update formulas of our estimation method as the solution which 
maximizes Q{r,p;f,p): 



where „ and Al^ „ are defined by 

= {m £ {I,--- ,M}; V £ C„„ u £ ^^,4, 
Mu,v = {"^ e {1, • • • , M}; u€Cm, v£ dCm}. 

Note that we can regard our estimation method as a variant of the EM algorithm. We want to 
emphasize here that each time iteration proceeds the value of the likelihood function never decreases 
and the iterative algorithm is guaranteed to converge due to the convexity of Q. 

Appendix B. Learning Algorithm for AsLT model 

An iterative parameter update algorithm similar to the AsIC model can be derived for the AsLT 
model, too. We first define (j)m,u,v for each v G Cm and u G Bm^v, ^-m,u.v for each v G dCm and u 
G {v} U B{v) \ Bm,v, and ijjm,u,v for each v G dCm and u G Bm,v, respectively by the following 
formulas. 



Let r = (r„) and q = {qu,v) be the current estimates of r and q, respectively. Similarly, let (/)m.u,v, 
and tpm,u,v denote the values of (j) and ipm,u,v calculated by using r and q, 

respectively. 






10. This can be easily verified using tlie Lagrange multipliers method with the constraint ^ 



= 1. 
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From Equations (9), (10) and (11), we can ti-ansform >C(r, q; Vm) as follows: 

\ogC{r,q]VM) = Q{r,q;f,q) - H(r,q;f,q), 
where Q{r, q; r, q) is defined by 

Q{r,q;f,q)=j: I ^ Qg], + Q&v] , 

m=l \v&Cm vedCm I 



(19) 



(20) 



^ ^ 9^m,u,ii log(Q'u,Dr^, exp( Ty(tjji^y tm,u)y) 

X] <^m,«,t; log(9u,i;) + X] V'm.u.i; log(gu,i; exp(-r^(T^ - t^,„))). 

ue{v}UB{v)\Bm,v ueBm.v 



It is easy to see that Q{r,q;r,q) is convex with respect to r and q, and H{r, p;r,q) is defined by 



M 



(21) 



m,u,v 

ue{v}UB{v)\Cm U&Bm.,v 

Since -ff (r, q; r, q) is maximized at r = r and q = q from Equation (21), we can increase the 
value of jC{r, q; Vm) by maximizing Q{r, q; r, q) (see Equation (19)). 

Thus, we obtain the following update formulas of our estimation method as the solution which 
maximizes Q{r,q;r,q) with respect to r : 

/ \ 

1^u,v — 4'Tn,u,v 



^ I 'y ' 'y ^ 'Pm,u,v(tm,v tm,u) ~l~ ^ ^ ^ ^ '4^m,u,viTm tri 

where Aig^ and A^^^^ are defined by 

>J« = {me {1,---,M}; vGCm}, 
= {mG{l,---,M}; ^; G 9C„}. 

As for q, we have to take the constraints qv.v + J2ueBiv) Qu,v = 1 into account for each v, which can 
easily be made using the Lagrange multipliers method, and we obtain the following update formulas 
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of our estimation method: 




where Mu,v, Mu}) and Mu}} are defined by 



M'i = {mG{l,---, 

M'i = {me{l,---, 
M^S = {me{lr--, 



,M}; V e Cm, U e Bm,v}, 
,M};ve dCm, u G B{v) \ Bm,v}, 
,M}; u G dCm, u e Bm,v}- 



The actual values are obtained after normalization. Here again, the convergence is guaranteed. 
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